Pi Autotune Embedded System

ECE 5725 Final Project
By: Rochelle Barsz (rsb359) and Alix Virgo (av522).

Demonstration Video

Objective

Pi Autotune is a pitch-adjusting device using the Raspberry Pi 4. There are three modes of operation: a tuner, recording and playback, and karaoke sing-along. Each mode uses ideas of frequency detection and adjustment to transform an audio signal into an in-tune recording.

Introduction

Pi Autotune is an embedded system to tune voice and instrumentals using a Raspberry Pi 4 and a USB-connected microphone. The graphical interface on the PiTFT reveals the extent to which the input is flat, sharp, or spot-on, providing a sliding scale to display the frequency difference between the input and the desired frequency. The user can rotate between this default mode, submitting a recording to be autotuned, as well as singing along to a song similar to Karaoke that results in a version of the song where the user is singing at the proper tempo. Pi Autotune is based on mathematical formulation using the Fast Fourier Transform (FFT), sound frequency relations, and domain transformation to properly autotune any input.

Design

Code Organization

Our project uses the following Python libraries: PyGame to interface with the capacitive touchscreen, GPIO to interact with physical buttons, and the ALSA library to record and playback wav files. GPIO and PyGame are libraries that must use the sudo command to interface with, but the ALSA “arecord” command failed when run with sudo. Because of this, main.py script handles the majority of the code and will be run with sudo, and audio.py performs audio interfacing that is run in the background without sudo. Additionally, a third script autotune_process.py contains the functions that are called from main.py, to organize the relatively complicated process of autotune.

Communication via FIFO

To communicate between the main and audio scripts, two FIFOs were created: fifo_to_main and fifo_to_audio. Each FIFO is used in a unidirectional manner, despite FIFOs having bidirectional capabilities, for ease of use. Simple strings are sent over the FIFOs that communicate what the other script is expected to do. Because reading from a FIFO is blocking, main.py and audio.py run simultaneously, but functionally are only running one at a time and wait while anticipating input from the other. The main script can send commands “cancel”, “aplay”, “arecord”, “arecord_short”, or “karaoke” to the audio script. The audio script sends an “OK” signal back to the main script to indicate it has completed the required command.

Main.py

The main file keeps track of the state, and performs the corresponding tasks. Upon startup, the system is in "Tuner mode," where the microphone records 1 second intervals of ambient sound. The run_listen function from autotune.py is called to detect the fundamental frequency of this short recording. Each potential fundamental frequency can be mapped to its closest true value. We define a true value as being one of the 12 chromatic notes in each octave. The chromatic frequencies are related mathematically, allowing us to generate these frequencies as a vector with equation below:

The lowest frequency corresponds to note C, and each frequency will be assigned accordingly, mapping to the vector

notes = ['C','C♯/D♭','D','D♯/E♭','E','F','F♯/G♭','G','G♯/A♭', 'A','A♯/B♭','B'] * 9

Given the closest possible frequency to the current fundamental frequency, the corresponding note is displayed on the screen. To exemplify the full tuner capability, we have a scale on the top of the screen that marks how far the fundamental frequency is from the desired frequency. If the arrow points directly in the center of the line, this corresponds to being off by 0 Hz, in other words the singer is spot on! If the arrow is to the right of 0, the singer is sharp, and if the arrow is to the left of 0, the singer is flat. The exact location of the arrow along the line depicts the frequency difference between the actual and desired frequencies in Hz.

The screen displays a start/stop recording button in the bottom left corner. When pressed from tuner mode, the system enters "recording mode", where the system records input sounds to a WAV file while displaying a blinking red circle at the top left of the screen. Recording will continue until the button is pressed again to end the recording. After this, the system automatically enters "playback mode", where the system autotunes the most recent WAV file, calling the run function in the autotune module, and plays the adjusted recording over the speakers.

Once playback is complete, the system will return to tuner mode. If the displayed quit button on the screen is pressed during recording or playback mode, the current process is terminated, and the system returns to tuner mode.

The last mode of our system is a “karaoke mode”, which is entered when the Sing button displayed in the bottom center of the screen is pressed. A short recording of 262 Hz, which is middle C, is played to give the user their starting frequency. The screen transitions to the next level where a countdown (“4”, “3”, “2”, “1”) is displayed on the screen to give the user the song’s tempo. Then, the screen displays the lyrics to the song “Twinkle Twinkle Little Star” and the user sings the tune along with the corresponding red-highlighted syllable to further reinforce the tempo. Both the tempo and frequency of the song are hardcoded into our autotune run_to_song function.

Finally, once the user has completed using Pi Autotune, they can press either the quit button or shutdown button and the final screen below is displayed.

In completion of the functionality mentioned above, main.py must communicate with audio.py to record audio and detect frequency values. While the device is in tuner mode, the main script will record one second intervals by sending the command “arecord_short” to the audio script via FIFO. Upon finishing the 1 second recording, main.py receives an “OK” signal in response. If recording mode is entered, the main script will echo “arecord” to the audio script, where the device will remain until some interrupt is sent. Either pressing the quit button, or pressing the stop recording button will echo a “cancel” to the audio script to terminate the command, then the main script will receive an “OK” in response to the audio recording saving as a WAV file. When the device enters record mode, “aplay” is sent to the audio script where the device will wait until the recording is finished playing, and will receive an “OK” in response once completed.

When the user begins karaoke mode, the main script will send “karaoke” to the audio script to begin that process. Then the main script will display the countdown after waiting the appropriate time and then wait for the audio file to finish the karaoke process. When the karaoke recording is complete the main script will receive an “OK” , then begin the autotune process and send the command “karaoke_aplay” to the audio script. This starts the playback of the karaoke song and will receive an “OK” when complete.

Audio.py

The audio script consists of a while loop that runs in the background, waiting to receive data via FIFO from the main script. Checking for data from the FIFO at every iteration is a blocking call, but all audio commands are self-contained and only one command will be processed at one time. It uses commands “arecord” and “aplay” from the Advanced Linux Sound Library (ALSA) python library to either record or playback a WAV file. All audio was sampled at 44100 Hz and all WAV files were 16 bits.

In the event “arecord_short” is received, which happens periodically if the device is in tuner mode, the arecord command is executed in the background to begin recording. At this point, the process ID is recorded by writing it to a text file. Then, after waiting for one second, the recording is terminated by running the kill command with the corresponding PID using a subprocess.

If the audio script receives “arecord” then the system has entered recording mode. Similar to “arecord_short”, a recording is begun and the PID of that command is recorded. However, the system will not terminate that recording on its own.

If the audio script receives “aplay” then the system has entered playback mode. The previous recording had already been converted to a WAV file, modified using the implemented autotune process, and converted back into a new WAV file. The aplay command is executed using a subprocess, and the code will only proceed once that recording is done. To provide feedback to the main script that playback mode can end, “OK” is sent via FIFO to the main script.

If the audio script receives “cancel”, then a few things could have happened on the user end: the quit button, or the end recording button. Regardless of what state the device is in, the audio script will kill the current audio process, where the PID of the process was recorded to a text file with the start recording command. The audio script sends back an “OK” to the main script to synchronize timing.

When karaoke mode begins, the audio file will receive “karaoke”. First the audio script will play a brief recording of 262 Hz, then wait for the user to see the countdown and begin singing. The audio script then begins the arecord process and notes the PID. After waiting 10 seconds, the length of the recording, the audio script will kill that process using the PID noted earlier, and an “OK” is sent back to the main script.

After the user has sung their karaoke part, the audio script will receive “karaoke_aplay”. This will be after the autotune process has finished on the karaoke recording, and the finished product will begin playing back. Once finished, an “OK” is echoed to the main script.

Autotune Implementation

The three modes of Pi Autotune contain their own functions within our autotune module. In our analysis for all sections, we use sampling frequency 44.1 kHz and we split up each input WAV file into sections of 16384 samples, corresponding to 0.37 seconds of audio each. (We chose a sampling rate of 44.1 kHZ since the human frequency detection range is up to approximately 20 kHz, and we must use a sampling rate that is greater than twice this highest frequency by the Nyquist-Shannon Sampling Theorem.)

The tuner mode calls the function run_listen(), which analyzes tuner_file.wav that is generated in the main code. Our first attempt to analyze a WAV file was to use package scipy.io.wavfile.read() to convert the input WAV file into a vector, to then split into our audio sections. This worked logically, but the Raspberry Pi does not have enough memory to store this large array, even for a recording of 1 second. As a result, we shifted to the built-in Python wave module, which enables splitting the original WAV file into small sections of size 16384 samples, and then converting this section into a numpy vector to be analyzed.

In each of the regimes, we conduct waveform analysis for each section of the WAV file, and to produce an output recording, we concatenate the resulting recordings for each section into a large WAV file to be returned.

Our analysis occurs in the frequency domain, using the numpy rfft function to convert to the frequency domain. By converting each section of the input WAV file to the frequency domain, we see the frequency components, enabling us to view which frequencies are most present in the WAV file. We perform peak detection to find the smallest yet highest amplitude frequency present in the frequency domain signal. This corresponds to the fundamental frequency of the input voice. Upon finding the fundamental frequency, we want to map this to the closest ‘true’ frequency, referring to one of the 12 chromatic notes over 9 total frequencies. The true frequency is passed to the main function in addition to the difference in frequency to be displayed to the user on the screen.

As an aside, there is a benefit to having more than 16384 samples per WAV file section. The frequency domain signal will contain the same number of points as the time domain signal, quantizing a continuous sweep of frequencies from 0 to ½ sampling rate into 16384 segments. This means that the frequency signal will contain frequencies in increments of 2.7 Hz. Including a fewer or greater number of samples per WAV file section will affect this frequency granularity and will limit or increase the autotuning that we can perform, respectively. We saw that if we increased the number of samples in each WAV file section to the next power of 2 past 16384, the Raspberry Pi was no longer able to handle the large array, so we chose to stay at the upper bound of 16384 samples per section of the WAV file.

The recording mode calls autotune function run(), which begins by performing the same initial analysis as the tuner mode, calculating the fundamental frequency and the new desired frequency. Given the difference in Hz between the fundamental and desired frequencies, the entire frequency spectrum is shifted by this difference. This is a valid approach to shifting frequencies as in the speaking range, each chromatic note is only 20-50 Hz apart, so that means that the input can at most be 10-25 Hz away from the desired frequency. By shifting the entire frequency spectrum, the harmonic properties of the human voice are still approximately maintained. These properties are that each harmonic of the human voice is an integer multiple of the fundamental frequency. By shifting frequencies by 10-25 Hz, each harmonic is still approximately an integer multiple of the fundamental frequency. Difficulty arises in the autotuning of a karaoke input, discussed in the next section.

Karaoke mode uses the autotune function run_to_song(), which takes an input song WAV file. Each syllable of the song is represented in one WAV file section, so it is assumed that the singer has aligned with the proper tempo as depicted in the karaoke interface. We additionally hardcode the song frequencies, starting with a middle C, at 262 Hz, which was played for the singer prior to recording. Similar to the tuner and recording modes, this function detects the fundamental frequency for each WAV file section. Instead of mapping this frequency to the nearest desired frequency, we set the desired frequencies manually according to the current placement in the song. This introduces the difficulty hinted at previously in shifting the frequency spectrum so that the result is a decent-sounding recording of ‘Twinkle Twinkle Little Star’ at the correct frequencies. The difference between the fundamental frequency and the hardcoded song frequency may be very large depending on the singing abilities of the user. If we shift the entire frequency spectrum by this change in frequency, it is no longer true that each harmonic is an integer multiple of the fundamental frequency. As a result, we must employ a slightly more sophisticated method to adjust the frequency spectrum.

We compute the fundamental frequency and ~7 harmonic frequencies as well as the new desired fundamental and harmonic frequencies. The difference between each harmonic may be smaller or larger than the original gap, depending on if the user sang higher or lower than the desired frequency, respectively. For each gap between harmonics, we map all of the frequency information from the original domain to the desired domain by the following formula:

new_point = ratio * (x-start) + new_start

where the ratio is defined as the new range divided by the original range, and start and new_start refer to the original starting frequency and the new starting frequency, respectively, for the current harmonic gap.

If the gap between the harmonics becomes smaller, some information will be lost here. If the gap between the harmonics increases, we interpolate using numpy.interp to fill in the remaining points.

This describes the autotune procedure for each of the three modes of Pi Autotune, each building on the prior with the overall goal of providing tuning information in the form of frequencies or recordings through frequency detection and adjustment.

Start_scripts Shell Script

The scripts main.py and audio.py must be run in parallel such that aforementioned communication can take place. The only script to run will be bash script start_scripts that runs audio.py in the background, which waits for commands from main.py, and main.py in the foreground. When the background audio script is run, we take note of its process ID number, such that if the quit or shutdown buttons are pressed, the background script quits properly by ‘killing’ the process corresponding to that number.

The last step was to have Pi Autotune function as an embedded device, disconnecting the Pi from connections other than the microphone and speakers. We attempted to use the crontab method by inserting the following command into the crontab configuration:

@reboot /home/pi/final_project/start_scripts

Each of our main, audio, and autotune files begin by redirecting the working directory to the /home/pi/final_project directory such that all audio recordings are placed and manipulated here. While start_scripts runs in an alien directory, meaning in a directory not necessarily in /home/pi, we were seeing lagging and freezing system behavior from the PiTFT interface. As an alternative, we used the bashrc method, meaning that we input the command /home/pi/final_project/start_scripts at the end of the bashrc script. We also allowed for auto login for the Raspberry Pi so that we did not need to login on reboot. This enabled smooth running of our programs on system reboot. This method worked and we were able to use Pi Autotune as an embedded system.

Hardware

This project is hosted by a Raspberry Pi 4's Linux operating system. The RPi connects to a capacitive touch screen, a PiTFT, that provides an interface for a user to provide inputs to the embedded system. A microphone interfaces to the system using USB, and a speaker using a 3.5 mm headphone jack.

The PiTFT contains GPIO buttons, of which two were used. The first button, GPIO 27, serves as a quit button that would exit the code and stop the program. Additionally, GPIO 22 acted as a shutdown button.

Results

Below are sample input and output audio recordings.

Recording 430Hz and Autotune Adjustment to 440Hz

General Recording and Autotune Adjustment

Karaoke Recording and Autotune Adjustment

The initial goals outlined of this project were exceeded. Initially, the design of Pi Autotune would have only consisted of the tuner, recording, and playback modes. One of the biggest issues encountered was that Pi Autotune couldn’t all happen in one script as the ALSA library didn’t run as expected when run with sudo. Figuring out FIFO communications wasn’t in the original scoping but was developed later out of necessity. Additionally, during the project it became evident that there was time to add more to Pi Autotune, so the karaoke mode was designed and implemented.

The addition of the karaoke mode added complexity to the autotune process that didn’t exist in the recording/playback mode. Originally a voice would only be shifted by relatively little, as the voice would be approximated to the nearest chromatic note. However, karaoke mode intended to map any initial pitch to a set of constant desired pitches. This introduced the difficulties of the human voice: the harmonics of the fundamental frequencies must also be adjusted properly, resulting in a new sized set of frequencies as the output had to be stretched or shrunk.

Conclusions

The result of the embedded device and autotune process is more than sufficient to satisfy the initial requirements. Frequency detection during the tuner mode was determined to be accurate, and the recording/playback mode produced an output waveform that sounded accurate to the expected autotune result. Karaoke mode sounded more distorted due to shifting across a much larger frequency, but was still adequate in approximating the autotune process. Altogether the autotune process was completed.

For this project, we successfully created two scripts that would be run upon startup through execution of a bashrc script, as simply starting the scripts directly from Crontab didn’t work. The scripts correctly communicate using FIFO in accordance with any timing constraints without losing functionality. The project successfully converted a time domain WAV file to the frequency domain for manipulating, after the initial method of converting a WAV file to vector didn’t work due to memory constraints, then later converting back into a WAV file for playback.

Future Work

The human voice is a complex phenomenon and we are by no means experts! This project exposed the complexities of adjusting human pitch, in that any manipulation requires careful consideration of the fundamental frequency as well as the harmonics of voice. If the mathematical relations between harmonic frequencies are not upheld, the human voice loses its characteristics.

Processing large audio files has high complexity, especially on the Raspberry Pi with little RAM and compute power. Next steps are to look into parallel computing and increasing RAM to analyze larger WAV files.

We additionally could add more modes to Pi Autotune, including a wider selection of songs in karaoke mode, and further user input in the autotune process.

Future work on this project is to further our knowledge of the physical properties of the human voice to improve our recording and karaoke mode analyses. While the current results shift frequencies properly, this project can be a source for continuous optimization in terms of authenticity and efficiency.